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METHOD FOR IDENTIFYING RELATED 
PAGES IN A HYPERLJNKED DATABASE 

FIELD OF THE INVENTION 

This invention relates generally to computerized informa- 
tion retrieval, and more particularly to identifying related 
pages in a hyperl inked database environment such as the 
World Wide Web. 

BACKGROUND OF THE INVENTION 

It has become common for users of host computers 
connected to the World Wide Web (the "Web") to employ 
Web browsers and search engines to locate Web pages 
having specific content of interest to users. A search engine, 
such as Digital Equipment Corporation's AltaVista search 
engine, indexes hundreds of millions of Web pages main- 
tained by computers all over the world. The users of the 
hosts compose queries, and the search engine identifies 
pages that match the queries, e.g., pages that include key 
words of the queries. These pages are known as a "result 
set" 

In many cases, particularly when a query is short or not 
well defined, the result set can be quite large, for example, 
thousands of pages. The pages in the result set may or may 
not satisfy the user's actual information needs. Therefore, 
techniques have been developed to identify a smaller set of 
related pages. 

In one prior art technique used by the Excite search 
engine, please see "http://www.excite.com," users first form 
an initial query, using the standard query syntax for the 
Excite search engine that attempts to specify a topic of 
interest. After the result set has been returned, the user can 
use a "Find Similar"* option to locate related pages. 
However, there the finding of the related pages is not fully 
automatic because the user first is required to form a query, 
before related pages can be identified. In addition, that 
technique only works on the Excite search engine and for the 
specific subset of Web pages, it provides related pages that 
are indexed by the Excite search engine. 

In another prior art technique, an algorithm for connec- 
tivity analysis of a neighborhood graph (n-graph) is 
described by Kleinberg in "Authoritative Sources in a 
Hyperlinked Environment," Proc. 9th ACM-SIAM Sympo- 
sium on Discrete Algorithms, 1998, and also in IBM 
Research Report RJ 10076, May 1997, see, "http:// 
www.cs.cornell.edu/Info/People/kleinber/auth.ps". The 
Kleinberg algorithm analyzes the link structure, or connec- 
tivity of Web pages "in the vicinity" of the result set to 
suggest useful pages in the context of the search that was 
performed. 

The vicinity of a Web page is defined by the hyperlinks 
that connect the page to others. A Web page can point to 
other pages, and the page can be pointed to by other pages. 
Close pages are directly linked, farther pages are indirectly 
linked via intermediate pages. This connectivity can be 
expressed as a graph where nodes represent the pages, and 
the directed edges represent the links. The vicinity of all the 
pages in the result set, up to a certain distance, is called the 
neighborhood graph. 

Specifically, the Kleinberg algorithm attempts to identify 
"hub" pages and "authority" pages in the neighborhood 
graph for a user query. Hubs and authorities exhibit a 
mutually reinforcing relationship. 

The Kleinberg paper cited above also describes an algo- 
rithm that can be used to determine related pages by starting 
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with a single page. The algorithm works by first finding a set 
of pages that point to the page, and then running the base 
algorithm on the resulting graph. However, this algorithm 
for finding related pages differs from our invention in that it 
s does not deal with popular URLs, with neighborhood graphs 
containing duplicate pages, or with cases where the com- 
putation is totally dominated by a single "hub" page, nor 
does the algorithm include an analysis of the contents of 
pages when it is computing the most related pages. 
10 The CLEVER Algorithm is a set of extensions to Klein- 
berg's algorithm, see S.Chakrabarti et al, "Experiments in 
Topic Distillation," ACM SIGIR Workshop on Hypertext 
Information Retrieval on the Web, Melbourne, Australia, 
1998. The goal of the CLEVER algorithm is to distill the 
15 most important sources of information from a collection of 
pages about a topic. 

In U.S. patent application Ser. No. 09/007,635 "Method 
for Ranking Pages Using Connectivity and Content Analy- 
sis" filed by Bharat et al. on Jan. 15, 1998, a method is 
20 described that examines both the connectivity and the con- 
tent of pages to identify useful pages. However, the method 
is relatively slow because all pages in the neighborhood 
graph are fetched in order to determine their relevance to the 
query topic. This is necessary to reduce the effect of non- 
25 relevant pages in the subsequent connectivity analysis 
phase. 

In U.S. patent application Ser. No. 09/058,577 "Method 
for Ranking Documents in a Hyperlinked Environment 
using Connectivity and Selective Content Analysis" filed by 

30 Bharat et al. on Apr. 9, 1998, now U.S. Pat. No. 6,112,203, 
a method is described which performs content analysis on 
only a small subset of the pages in the neighborhood graph 
to determine relevance weights, and pages with low rel- 
evance weights are pruned from the graph. Then, the pruned 

35 graphed is ranked according to a connectivity analysis. This 
method still requires the result set of a query to form a query 
topic. 

Therefore, there is a need for a method for identifying 
related pages in a linked database that does not require a 
40 query and the fetching of many unrelated pages. 

SUMMARY OF THE INVENTION 
Provided is a method for identifying related pages among 
a plurality of pages in a linked database such as the World 

45 Wide Web. An initial page is selected from the plurality of 
pages by specifying the URL of the page or clicking on the 
page using a Web browser in a convenient manner. 

Pages linked directly or indirectly to the initial page are 
represented as a neighborhood graph in a memory. The 

50 pages represented in the graph are scored on content using 
a similarity measurement using a topic extracted from a 
chosen subset of the represented pages. 

A set of pages is selected from the pages in the graph, the 
selected set of pages having scores greater than a first 

55 predetermined threshold and do not belong to a predeter- 
mined list of "stop URLs." Stop URLs are highly popular, 
general purpose sites such as search engines. The selected 
set of pages is then scored on connectivity, and a subset of 
the set of pages that have scores greater than a second 

60 predetermined threshold are selected as related pages. 
Finally, during an optional pass, content analysis can be 
done on highly ranked pages to determine which pages have 
high content scores. 

65 BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a hyperlinked environment 
that uses the invention; 
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FIG. 2 is a flow diagram of a method according to the as a software program in either a client or server computer, 
invention. In either case, the computers 110, 120, and 140 include 

conventional components such a processor, memory, and 
DETAILED DESCRIPTION OF PREFERRED I/0 dcvices that can be ^ to im pi eme nt our method. 



EMBODIMENTS 
System Overview 



Building the Neighborhood Graph 



We start with an initial single selected page 201, i.e., the 
FIG. 1 shows a database environment 100 where the page 2 oi includes a topic which is of interest to a user. The 
invention can be used. The database environment is an user can select the page 201 by, for example, giving the URL 
arrangement of client computers 110 and server computers 10 or "clicking" on the page. It should be noted that the initial 
120 (generally "hosts") connected to each other by a net- sc lected page can be any type of linked data object, text, 
work 130, for example, the Internet. The network 130 video, audio, or just binary data as stated above. 

includes an application level interface called the World Wide „ 7 ~ ->m *~ „ „,_ t -, in „ „ . „ 

«w in e use initial page 201 to construct 21U a neignbor- 

Web (the Web ) 131. hood gfaph ( n . graph ) 2 11 i n a memory. Nodes 212 in the 

The Web 131 allows the clients 110 to access documents, 15 graph rep resent the initial selected page 201 as well as other 
for example, multi-media Web pages 121 maintained by the dosely linked pages> ^ described below , ^ edges 2 13 
servers 120. Typically, this is done with a Web browser denote the hyperlinks between pages. The "size" of the 
application program (B) 114 executing in the client 110, The graph is de t er mined by K which can be preset or adjusted 
location of each page 121 is indicated by an associated dynamically as the graph is constructed. The idea being that 
Universal Resource Locator (URL) 122. Many of the pages 20 tne grapn needs t0 repr esent a meaningful number of pages, 
include "hyperlinks" 123 to other pages. The hyperlinks are tfac constniction of the neighborhood graph 211, 

also m the torm oi URLs. the direction of links ^ considered as a way of pruning the 

Although the invention is described with respect to docu- grapn In the pre ferred implementation, with K=2, our 
ments that are Web pages, it should be understood that our me thod only includes nodes at distance 2 that are reachable 
invention can also be applied to any linked data objects of 25 by gQing 0Qe Hnk backwards (« B «^ pages rea chable by 
a database whose content and connectivity can be charac- going one link f orwards («F"), pages reachable by going one 
terized. backwards followed by one link forward ("BF") and 

In order to help users locate Web pages of interest, a those reachable by going one link forwards and one link 
search engine 140 can maintain an index 141 of Web pages backwards- ("FB")- This eliminates nodes that are reachable 
in a memory, for example, disk storage. In response to a 30 only by going forward two links ("FF") or backwards two 
query 111 composed by a user using the Web browser (B) links ("BB"). 

114, the search engine 140 returns a result set 112 which Xo e ii mi nate some unrelated nodes from the neighbor- 
satisfies the terms (key words) of the query 111. Because the hood graph 2 H, our method relies on a list of "stop" URLs, 
search engine 140 stores many millions of pages, the result stop URLs are URLs that are so popular that they are 
set 112, particularly when the query 111 is loosely specified, frequently referenced from many, many pages, such as, for 
can include a large number of qualifying pages. instance URLs that refer to popular search engines. An 

These pages may, or may not be related to the user's example is "www.altavista.com." These "stop" nodes are 
actual information need. Therefore, the order in which the very general purpose and so are generally not related to the 
result 112 set is presented to the client 110 is indicative of specific topic of the selected page 201, and consequently 
the usefulness of the search engine 140. A good ranking se rve no purpose in the neighborhood graph. Our method 
process will return only "useful" pages before pages that are checks each URL against the stop list during the neighbor- 
less so. hood graph construction, and eliminates the node and all 

We provide an improved ranking method 200 that can be incoming and outgoing edges if a URL is found on the stop 
implemented as part of a search engine 140. Alternatively, 45 list. 

our method 200 can be implemented by one of the clients i n some cases, the neighborhood graph becomes too large. 
110 as part of the Web browser 114. Our method uses For example, highly popular pages are often pointed to by 
content analysis, as well as connectivity analysis, to improve ma ny thousands of pages and including all such pages in the 
the ranking of pages in the result set 112 so that just pages neighborhood graph is impractical. Similarly, some pages 
related to a particular topic are identified. 50 contain thousands of outgoing links, which also causes the 

graph to become too large. Our method filters the incoming 
or outgoing edges by choosing only a fixed number M of 
Our invention is a method that takes an initial single them. In our preferred implementation, M is 50. In the case 
selected Web page 201 as input, and produces a subset of that the page was reached by a backwards link L, and the 
related Web pages 113 as output. Our method works by 55 page has more than M outgoing links, our method chooses 
examining the "neighborhood" surrounding the initial the M links that surround the link L on the page, 
selected page 201 in a Web neighborhood graph and exam- i n the case of a page P that has N pages pointing to page 
ining the content of the initial selected page and other pages p, our method will choose only a subset of M of the pages 
in the neighborhood graph. for inclusion in the neighborhood graph. Our method 

Our method relies on the assumption that related pages so chooses the subset of M pages from the as larger set of N 
will tend to be "near" the selected page in the Web neigh- pages pointing to page P by selecting the subset of M pages 
borhood graph, or that the same keywords will appear as part with the highest in-degree in the graph. The idea being that 
of the content of related pages. The nearness of a page can pages with high in-degree are likely to be of higher quality 
be expressed as the number of links (K) that need to be than those with low in-degree. 

traversed to reach a related page. 65 In some cases, pages will have identical content, or nearly 

FIG. 2 shows the steps of a method according to our identical content. This can happen when pages were copied, 
invention. As stated above, the method can be implemented for example. In such cases, we want to include only one such 
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page in our neighborhood graph, since the presence of neighborhood graph construction at smaller values of K 

multiple copies of a page will tend to artificially increase the when pages are reached that have very little content in 

importance of any pages that the copies point to. We collapse common with the original page, 
duplicate pages to a single node in the neighborhood graph. 

There are several ways that one could identify duplicate 5 Connectivity Scoring the Pruned Graph 

P a&es * In step 240, the pruned graph 216 is scored again, this 

One way examines the textual content of the pages to see time based on connectivity. This scoring effectively ranks 

if they are duplicates or near-duplicates, as described by lhe pageSj and pages above a predetermined rank can be 

Broder et al. in U.S. patent application Ser. No. 09/048,653, presented to the user as the related pages 113. 

"Method for clustering closely resembling data objects" 10 ^ . ... ... - iL . . . A , T „ . 

Ai a \a mno it c d * kt < iTn 1 a *u O ne algorithm which performs this sconng is the Klein- 
filed Mar. 26, 1998, now U.S. Pat. No. 6,119,124. Another . . * * . , « . , , 

t . , . . * *• ii • i l. • i_ j bere algorithm mentioned previously. This aleonthm works 

way that is less computationally expensive and which does , x °. . -5 c T. j ■ 

/ . , * * r • * .i_ by iteratively computing two scores for each node in the 

not require the content of the page, is to examine the , , , J n™\%Ai j *i_ ^ 

, . ,. , a. xr.L * ve . L graph: a hub score (HS) 241 and an authority score 242. The 

outgoing links or two pages. It there are a significant number f . , jul c i 

c . • ii- j.u *i • i i ,i 1C hub score 241 estimates good hub pages, for example, a page 

of outgoing links and they are mostly identical, these pages 15 , ,. . i * 

vi i j I j v * vi7 • j . / V . • such as a directory that points to many other relevant pages, 

are likely to be duplicates. We identify this case by choosing « iL J t J , tU . t r ° r 

/ u f r v i ^ o im j im The authority score 242 estimates good authority pages, for 

a threshold number of links Q. Pages PI and P2 are . 7 . u . * , t . 7 y & ' 

■ j j j *• . - fU ,u m j m u *u example, a page that has relevant information, 

considered near duplicates if both PI and P2 have more than r r ° 

Q links, and a large fraction of their links are present in both ^ intuition behind Kleinberg's algorithm is that a good 

PI and P2 20 nub ^ one ^ at P oults t0 many documents and a good 

authority is one that is pointed to by many documents. 

Relevancy Scoring of Nodes in the Neighborhood Transitively, an even better hub is one that points to many 

Graph good authorities, and an even better authority is one that is 

pointed to by many good hubs. 

We next score 220 the content of the pages represented by „ Bharal et al cjted ab have come wjlh severa , 

the graph 211 with respect to a topic 202. We extract the improved algorithms ^ provide more accurate results than 

topic 202 from the initial page 201. Kleinberg's algorithm, and any of these could be used as in 

Scoring can be done using well known retrieval tech- S ( e p 240. 

niques. For example, in the Salton & Buckley model, the [f a ^ nodc has dominaled (nc computation B a hub 

content of the represented pages 211 and the topic 202 can 30 node> that is> exerted « wdm influence", then it is sometimes 

be regarded as vectors in an n-dimensional vector space, beneficial t0 rem0V6 that node from ttl6 neighborhood graph, 

where n corresponds to the number of unique terms m the and fepeat (he scorfng phase ^ on the graph wjm , he Dode 

data set. removed. One way of detecting when this undue influence 

A vector matching operation based on the cosine of the has been exerted is when a single node has a large fraction 

angle between vectors is used to produce scores 203 that 35 of me total hub scores of all the nodes (e.g., more than 95% 

measure similarity. Please see, Salton et al., "Term- of the total hub scores is attributed to a single node). Another 

Weighting Approaches in Automatic Text Retrieval," Infor- mean s determines if the node with the highest hub score has 

mation Processing and Management, 24(5), 513-23, 1988. A morc than three times the hub score of the next highest hub 

probabilistic model is described by Croft et al. in "Using score . Other means of determining undue influence are 

Probabilistic Models of Document Retrieval without Rel- 40 p 0ss iblc. 
evance Feedback," Documentation, 35(4), 285-94, 1979. 

For a survey of ranking techniques in Information Retrieval Differences with the Prior Art 

see Frakes et al., "Information Retrieval: Data Structures & rt , 

Algorithms," Chapter 14-'Ranking Algorithms,' Prentice- ? ur method dlff ° re . fro , m P rl0r a " ,n J the building 

Hall NJ 1992 45 and pruning steps. A simple prior art building method treated 

' * . , the n-graph as an undirected graph and used any page within 

Our topic vector can be determined as the term vector of a dis[ance R (Q constnlcl the h Refinements to this 

the initial page 201 or as a vector sum of the term vector of method 00Mider the h as direc , ed and a „ owed a certain 

the initial selected page and some function of the term number of backward h link traversals as rt of the 

vectors of all the pages presented in the neighborhood graph neighborhood grapn construction. Notice, this refinement 

211. One such function could simp ywe lg ht the term vectors jres backwards ^nectivity information that is not 

of each of the pages equally, while another more complex direct , m in , he Web toemselves. 

function would give more weight to the term vectors of . . , , 

... 4 ° 11 j • * is c 4U i * j This information can be provided by a server 150, such as 

pages that are at a smaller distance K from the selected page . . r , . , . , T ^ 

„ . . , , ~ te a connectivity server or a search engine database, see U.S. 

201. Scoring 220 results in a scored graph 215. , ./ . „ . T rxr , , r ^~ • 

6 & v 55 patent application Ser. No. 09/037,350 "Connectivity 

Pruning Nodes in the Scored Neighborhood Graph Server " filed b y Broder el aL on Mar " 1998 > now U S " 

Pat. No. 6,073,135. Typical values of K can be 2 or 3. 

After the graph has been scored, the scored graph 215 is Alternatively, K can be determined dynamically, depending 

"pruned" 230 to produce a pruned graph 216. Here, pruning on the size of the neighborhood graph, for example, first try 

means removing those nodes and links from the graph that 60 to build a graph for K=2, and if this graph is not considered 

are not "similar." There are a variety of approaches which large enough, use a larger value for K. 

can be used as the threshold for pruning, including median There are two differences in our method. First, we start 

score, absolute threshold, or a slope-based approach. with a single Web page as input, rather than the result set 

In addition, content analysis can be used to guide the produced by a search engine query. The second difference 
neighborhood graph construction process by extending the 65 deals with how the initial neighborhood graph 211 is con- 
search out to larger distances of K for pages whose contents structed. Kleinberg includes all pages that have a directed 
are closely related to the original page, and cutting off the path of length K from or to the initial set. 
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In contrast, we look at the Web graph as an undirected 
graph and include all pages that are K undirected links away 
from our initial selected page. This has the benefit of 
including pages that can be reached by an "up -down" path 
traversal of the graph, such as pages that are both indexed 5 
by the same directory page, but which are not reachable 
from each other using just a directed path. In some cases we 
choose to specify the type of paths allowed explicitly, e.g., 
only F, B, FB, BF as described above. 

In the presence of useful hub pages, pages that point to 30 
many related pages, our approach will include all of the 
related pages referenced by the hub which might be similar 
to the selected page 201 in our neighborhood graph. 

Pruning 35 

Our method differs from the Kleinberg method because 
there no pruning of the neighborhood graph is performed. 
Bharat et al. improves the Kleinberg method by pruning the 
graph to leave a subset of pages that are fed to the ranking 2Q 
step to yield more accurate results. 

However, because we start with a single Web page, rather 
than with results of a query, we do not have an initial query 
against which to measure the relevance of the related pages. 
Instead, we use the content of the initial page, and optionally 2 s 
the content of other pages in the neighborhood graph to 
arrive at a topic vector. 

Scoring 

Our method differs from Kleinberg's algorithm in the 30 
scoring phase in that we detect cases where a node has 
exerted "undue influence" on the computation of hub scores. 
In this case, we remove the node from the graph and repeat 
the scoring computation without this node in the graph. This 
change tends to produce a more desirable ordering of related 35 
pages where highly rated pages are referred to by more than 
one page. Kleinberg' s algorithm does not include any such 
handling of nodes with undue influence. 

Advantages and Applications 40 

Our invention enables automatic identification of Web 
pages related to a single Web page. Thus, if a user locates 
just one page including an interesting topic, then other pages 
related to the topic are easily located. According to the ^ 
invention, the relationship is established through the use of 
connectivity and content analysis of the page and nearby 
pages in the Web neighborhood. 

By omitting the content analysis steps of our method, the 
method is able to identify related URLs for the selected page 50 
201 solely through connectivity information. Since this 
information can be quickly provided by means of a connec- 
tivity server 150, the set of related pages can be identified 
without fetching any pages or examining the contents of any 
pages. 55 

One application of this invention allows a Web browser in 
a client computer to provide a "Related Pages" option, 
whereby users can quickly access any of the related pages. 
Another application is in a server computer that implements 
a Web search engine. There, a similar option allows a user $o 
to list just related pages, instead of the entire result set of a 
search. 

It is understood that the above-described embodiments are 
simply illustrative of the principles of the invention. Various 
other modifications and changes may be made by those 65 
skilled in the art which will embody the principles of the 
invention and fall within the spirit and scope thereof. 



We claim: 

1. A method for identifying pages related to an initial 
page, comprising: 

identifying a plurality of pages linked to the initial page; 

representing the plurality of pages as a graph of nodes; 

scoring the plurality of pages on connectivity of said 
plurality of pages to the initial page to generate a 
connectivity score for each of said plurality of pages; 

removing from the graph of nodes pages with an undue 
influence on the scoring of other pages in the plurality 
of pages, wherein a page has the undue influence on the 
scoring of other pages in the plurality of pages, when 
said page has a score greater than a predetermined 
fraction of a total connectivity score, said total connec- 
tivity score computed by summing connectivity scores 
of the plurality of pages; 

re-scoring remaining pages represented in the graph of 
nodes; and 

selecting a subset of the remaining pages represented in 
the graph of nodes that have connectivity scores greater 
than a first predetermined threshold as the pages related 
to the initial page. 

2. The method of claim 1 wherein the initial page is 
selected by specifying an address of the page. 

3. The method of claim 1 wherein the initial page is 
selected by a user interface. 

4. The method of claim 1 wherein the connectivity of the 
plurality of pages are scored on content by measuring the 
similarity of the plurality of pages to a topic. 

5. The method of claim 4 wherein the topic is extracted 
from the initial page. 

6. The method of claim 4 wherein the topic is extracted 
from the plurality of pages represented in the graph. 

7. The method of claim 1 wherein any of the plurality of 
pages that are linked in any direction to the initial page are 
represented in the graph. 

8. The method of claim 7 wherein the plurality of pages 
represented in the graph are linked to the initial page by a 
predetermined number of links. 

9. The method of claim 7 wherein each page represented 
in the graph depends on a path from each page to the initial 
page, the path including the length of the path and the 
direction of edges on the path. 

10. The method of claim 7 wherein the plurality of pages 
represented in the graph as nodes are linked to the node 
representing the initial page by a number of edges that is 
determined dynamically. 

11. The method of claim 1 performed in a client computer. 

12. The method of claim 1 performed in a server com- 
puter. 

13. The method of claim 1, wherein the predetermined 
fraction of the total connectivity score is equal to ninety-five 
percent of the total connectivity score. 

14. The method of claim 1, wherein the step of identifying 
a plurality of pages linked to the initial page comprises 
identifying a plurality of pages linked to the initial page by 
not more than a defined number of links, wherein when said 
defined number of links is set to two, a page is linked to said 
initial page only if said page is reachable from said initial 
page by going one link backwards, said page is reachable 
from said initial page by going one link forwards, said page 
is reachable from said initial page by going one link back- 
wards followed by one link forwards, or said page is 
reachable from said initial page by going one link forward 
followed by one link backwards. 
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15. A method for identifying pages related to an initial 
page, comprising: 

identifying a plurality of pages linked to the initial page; 

representing the plurality of pages as a graph of nodes; 

scoring the plurality of pages on connectivity of said 
plurality of pages to the initial page to generate a 
connectivity score for each of said plurality of pages; 

removing from the graph of nodes pages with an undue 
influence on the scoring of other pages in the plurality 30 
of pages, wherein a page has the undue influence on the 
scoring of other pages in the plurality of pages when 
said page has a score greater than each score of all other 
pages in the plurality of pages and said score is at least 
three times greater than a next highest score of another 15 
page in said plurality of pages; 

re-scoring remaining pages represented in the graph of 
nodes; and 

selecting a subset of the remaining pages represented in 20 
the graph of nodes that have connectivity scores greater 
than a first predetermined threshold as the pages related 
to the initial page. 

16. The method of claim 15 wherein the initial page is 
selected by specifying an address of the page. 25 

17. The method of claim 15 wherein the initial page is 
selected by a user interface. 

18. The method of claim 15 wherein the connectivity of 
the plurality of pages are scored on content by measuring the ^ 
similarity of the plurality of pages to a topic. 

19. The method of claim 18 wherein the topic is extracted 
from the initial page. 

20. The method of claim 18 wherein the topic is extracted 
from the plurality of pages represented in the graph. 35 

21. The method of claim 15 wherein any of the plurality 
of pages that are linked in any direction to the initial page are 
represented in the graph. 

22. The method of claim 21 wherein the plurality of pages 
represented in the graph are linked to the initial page by a 40 
predetermined number of links. 

23. The method of claim 21 wherein each page repre- 
sented in the graph depends on a pat from each page to the 
initial page, the path including the length of the path and the 
direction of edges on the path. 45 

24. The method of claim 21 wherein the plurality of pages 
represented in the graph as nodes are linked to the node 
representing the initial page by a number of edges that is 
determined dynamically. 

25. The method of claim 15 performed in a client com- 50 
puter. 

26. The method of claim 15 performed in a server 
computer. 

27. The method of claim 15, wherein the step of identi- 
fying a plurality of pages linked to the initial page comprises 55 
identifying a plurality of pages linked to the initial page by 
not more than a defined number of links, wherein when said 
defined number of links is set to two, a page is linked to said 
initial page only if said page is reachable from said initial 
page by going one link backwards, said page is reachable 60 
from said initial page by going one link forwards, said page 

is reachable from said initial page by going one link back- 
wards followed by one link forwards, or said page is 
reachable from said initial page by going one link forward 
followed by one link backwards. 65 

28. A computer program product readable by a computing 
system and encoding a computer program of instructions for 



executing a computer process for identifying pages related 
to an initial page, said computer process comprising: 

identifying a plurality of pages linked to the initial page; 

representing the plurality of pages as a graph of nodes; 

scoring the plurality of pages on connectivity of said 
plurality of pages to the initial page to generate a 
connectivity score for each of said plurality of pages; 

removing from the graph of nodes pages with an undue 
influence on the scoring of other pages in the plurality 
of pages, wherein a page has the undue influence on the 
scoring of other pages in the plurality of pages, when 
said page has a score greater than a predetermined 
fraction of a total connectivity score, said total connec- 
tivity score computed by sung connectivity scores of 
the plurality of pages; 

re-scoring remaining pages represented in the graph of 
nodes; and 

selecting a subset of the remaining pages represented in 
the graph of nodes that have connectivity scores greater 
than a first predetermined threshold as the pages related 
to the initial page. 

29. The computer program product of claim 28 wherein 
the computer process selects the initial page by specifying 
an address of the page. 

30. The computer program product of claim 28 wherein 
the computer process selects the initial page by receiving 
input from a user interface. 

31. The computer program product of claim 28 wherein 
the computer process scores connectivity of the plurality of 
pages on content by measuring the similarity of the plurality 
of pages to a topic. 

32. The computer program product of claim 31 wherein 
the computer process extracts the topic from the initial page. 

33. The computer program product of claim 31 wherein 
the computer process extracts the topic from the plurality of 
pages represented in the graph. 

34. The computer program product of claim 28 wherein 
the computer process represents the plurality of pages that 
are linked in any direction to the initial page in the graph. 

35. The computer program product of claim 34 wherein 
the computer process links the plurality of pages represented 
in the graph to the initial page by a predetermined number 
of links, 

36. The computer program product of claim 34 wherein 
each page represented in the graph depends on a path from 
each page to the initial page, the path including the length of 
the path and the direction of edges on the path. 

37. The computer program product of claim 34 wherein 
the computer process links the plurality of pages represented 
in the graph as nodes to the node representing the initial page 
by a number of edges that is determined dynamically. 

38. The computer program product of claim 28 wherein 
the computer process is performed in a client computer. 

39. The computer program product of claim 28 wherein 
the computer process is performed in a server computer. 

40. The computer program product of claim 28, wherein 
the first predetermined threshold is equal to ninety-five 
percent of the total connectivity score. 

41. The computer program product of claim 28, wherein 
the computer process step of identifying a plurality of pages 
linked to the initial page comprises identifying a plurality of 
pages linked to the initial page by not more than a defined 
number of links, wherein when said defined number of links 
is set to two, a page is linked to said initial page only if said 
page is reachable from said initial page by going one link 
backwards, said page is reachable from said initial page by 
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going one link forwards, said page is reachable from said 46. The computer program product of claim 45 wherein 

initial page by going one link backwards followed by one the computer process extracts the topic from the initial page, 

link forwards, or said page is reachable from said initial page 47. The computer program product of claim 45 wherein 

by going one link forward followed by one link backwards. the computer process extracts the topic from the plurality of 

42. A computer program product readable by a computing 5 pages represented in the graph. 

system and encoding a computer program of instructions for 48. The computer program product of claim 42 wherein 

executing a computer process for identifying pages related the computer process represents the plurality of pages that 

to an initial page, said computer process comprising: are linked in any direction to the initial page in the graph. 

identifying a plurality of pages linked to the initial page; 49. The computer program product of claim 48 wherein 

representing the plurality of pages as a graph of nodes; 10 the computer process links the plurality of pages represented 

scoring the plurality of pages on connectivity of said * ^ griph to the initial page by a predetermined number 

plurality of pages to the initial page to generate a 0 _ m * . . AO , 

y - t f . p •/ i v* t « 50. The computer program product of claim 48 wherein 

connectivity score for each of said plurality of pages; p *: . 6 * , tU c 

J each page represented in the graph depends on a path from 

removing from the graph of nodes pages with an undue 15 each page tQ the initia , pag£j the path including the length of 

influence on the scoring of other pages in the plurality me path and the direction of ^ ges on the patn . 

of pages, wherein a page has the undue influence on the 51 ^ computer prog ram product of claim 48 wherein 

scoring of other pages in the plurality of pages when , hc process Unks the phir ality of pages represented 

said page has a score greater than each score of all other in the gfaph ^ nodes t0 the node representing lhe initial page 

pages in the plurality of pages and said score is at least 20 by a of edgcs ^ is determined dynamically, 

three times greater than a next highest score of another 52 ^ computer progra m product of claim 42 wherein 

page in said plurality of pages; the compilter proce ss is performed in a client computer. 

re -scoring remaining pages represented in the graph of 53. The computer program product of claim 42 wherein 

nodes; and the computer process is performed in a server computer. 

selecting a subset of the remaining pages represented in 25 54. The computer program product of claim 42, wherein 

the graph of nodes that have connectivity scores greater the computer process step of identifying a plurality of pages 

than a first predetermined threshold as the pages related linked to the initial page comprises identifying a plurality of 

to the initial page. pages linked to the initial page by not more than a defined 

43. The computer program product of claim 42 wherein number of links, wherein when said defined number of links 
the computer process selects the initial page by specifying 30 is set to two, a page is linked to said initial page only if said 
an address of the page. page is reachable from said initial page by going one link 

44. The computer program product of claim 42 wherein backwards, said page is reachable from said initial page by 
the computer process selects the initial page by receiving going one link forwards, said page is reachable from said 
input from a user interface. initial page by going one link backwards followed by one 

45. The computer program product of claim 42 wherein 35 link forwards, or said page is reachable from said initial page 
the computer process scores connectivity of the plurality of by going one link forward followed by one link backwards, 
pages on content by measuring the similarity of the plurality 

of pages to a topic. ***** 
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